The goal of this project is to make a recommendation for the launch of an online book group. We want to know which types of books tend to lead to the most engaged readers. The books should both appeal to a large volume of readers and tend to drive discussion among readers, since these qualities would support the group's success.
From an informal poll of my network I've formed a hypothesis that fiction books would lend themselves better to an online book group than non-fiction. Now I want to examine a large dataset that will either lead me to support or reject that hypothesis. I also want to learn if there are more specific categories of books that would lead to more interest.
I'm using a dataset from Kaggle that compiles data for 100k books scraped from Goodreads. Goodreads is an extensive database of books that allows users to rate and review books, and to generate reading lists. As of 2022 the site reported 90 million registered members so we can assume that it reaches a broad audience of book readers.
We're making a couple of important assumptions:
First we'll import the Python libraries, then load the csv file into a dataframe, view a few rows, and generate some descriptive stats about the data. Next we'll determine which columns to focus on and what steps we should take to clean the data.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
data = pd.read_csv('GoodReads_100k_books.csv')
#View how many rows, columns, and non-null values exist in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 author 100000 non-null object 1 bookformat 96772 non-null object 2 desc 93228 non-null object 3 genre 89533 non-null object 4 img 96955 non-null object 5 isbn 85518 non-null object 6 isbn13 88565 non-null object 7 link 100000 non-null object 8 pages 100000 non-null int64 9 rating 100000 non-null float64 10 reviews 100000 non-null int64 11 title 99999 non-null object 12 totalratings 100000 non-null int64 dtypes: float64(1), int64(3), object(9) memory usage: 9.9+ MB
#Generate descriptive stats about the numeric columns
data.describe()
| pages | rating | reviews | totalratings | |
|---|---|---|---|---|
| count | 100000.000000 | 100000.000000 | 100000.000000 | 1.000000e+05 |
| mean | 255.010240 | 3.833055 | 181.528450 | 2.990764e+03 |
| std | 367.913582 | 0.621237 | 1449.451229 | 3.635338e+04 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 |
| 25% | 135.000000 | 3.660000 | 3.000000 | 3.100000e+01 |
| 50% | 240.000000 | 3.910000 | 15.000000 | 1.460000e+02 |
| 75% | 336.000000 | 4.140000 | 67.000000 | 7.440000e+02 |
| max | 70000.000000 | 5.000000 | 158776.000000 | 3.819326e+06 |
#View the first three rows
data.head(3)
| author | bookformat | desc | genre | img | isbn | isbn13 | link | pages | rating | reviews | title | totalratings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Laurence M. Hauptman | Hardcover | Reveals that several hundred thousand Indians ... | History,Military History,Civil War,American Hi... | https://i.gr-assets.com/images/S/compressed.ph... | 002914180X | 9.78E+12 | https://goodreads.com/book/show/1001053.Betwee... | 0 | 3.52 | 5 | Between Two Fires: American Indians in the Civ... | 33 |
| 1 | Charlotte Fiell,Emmanuelle Dirix | Paperback | Fashion Sourcebook - 1920s is the first book i... | Couture,Fashion,Historical,Art,Nonfiction | https://i.gr-assets.com/images/S/compressed.ph... | 1906863482 | 9.78E+12 | https://goodreads.com/book/show/10010552-fashi... | 576 | 4.51 | 6 | Fashion Sourcebook 1920s | 41 |
| 2 | Andy Anderson | Paperback | The seminal history and analysis of the Hungar... | Politics,History | https://i.gr-assets.com/images/S/compressed.ph... | 948984147 | 9.78E+12 | https://goodreads.com/book/show/1001077.Hungar... | 124 | 4.15 | 2 | Hungary 56 | 26 |
These are the findings from the initial exploration of the data:
Let's look at the total and unique values in the non-numeric columns and determine how we should clean the data.
#View the total and unique values for the non-numeric columns
data.describe(exclude="number").iloc[0:2,:]
| author | bookformat | desc | genre | img | isbn | isbn13 | link | title | |
|---|---|---|---|---|---|---|---|---|---|
| count | 100000 | 96772 | 93228 | 89533 | 96955 | 85518 | 88565 | 100000 | 99999 |
| unique | 68767 | 202 | 92499 | 72129 | 96955 | 85518 | 725 | 100000 | 97588 |
These are the findings around data cleaning:
Now that we know how to clean the data, we'll drop the columns we don't need, reorder the remaining columns, and then see if there are any duplicate titles. Because we have some missing values in the ISBN column, we'll consider an entry to be potentially duplicated if the title, author, ISBN and format all match.
#Drop unneeded columns
data_clean = data.copy()
data_clean.drop(['desc', 'img', 'link', 'pages', 'isbn13'], axis=1, inplace=True)
#Reorder remaining columns
data_clean = data_clean[['title', 'author','bookformat', 'isbn', 'genre', 'rating', 'totalratings', 'reviews']]
#Find potential duplicate books
titles_duplicated = data_clean[data_clean.duplicated(['title', 'isbn', 'author', 'bookformat'], keep=False)]
titles_duplicated.head(10)
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | |
|---|---|---|---|---|---|---|---|---|
| 5978 | Kiroileva siili | Milla Paloniemi | Hardcover | NaN | Sequential Art,Comics,European Literature,Finn... | 3.58 | 124 | 5 |
| 6055 | Kiroileva siili | Milla Paloniemi | Hardcover | NaN | Sequential Art,Comics,European Literature,Finn... | 3.64 | 75 | 5 |
| 22069 | Savage Garden | Lee Hyeon-sook | Paperback | NaN | Sequential Art,Manga,Manga,Manhwa,Sequential A... | 4.06 | 138 | 10 |
| 22072 | Savage Garden | Lee Hyeon-sook | Paperback | NaN | Sequential Art,Manga,Manga,Manhwa,Sequential A... | 4.04 | 89 | 3 |
| 22073 | Savage Garden | Lee Hyeon-sook | Paperback | NaN | Sequential Art,Manga,Manga,Manhwa,Sequential A... | 4.07 | 86 | 3 |
| 44987 | Honggane | Mi-Ri Hwang | NaN | NaN | Sequential Art,Manga,Manga,Manhwa,Romance,Mang... | 4.02 | 123 | 2 |
| 44991 | Honggane | Mi-Ri Hwang | NaN | NaN | Sequential Art,Manga,Manga,Manhwa,Romance,Mang... | 4.13 | 80 | 1 |
| 44993 | Honggane | Mi-Ri Hwang | NaN | NaN | Sequential Art,Manga,Manga,Manhwa,Romance,Mang... | 4.13 | 77 | 1 |
| 44995 | Honggane | Mi-Ri Hwang | NaN | NaN | Sequential Art,Manga,Manga,Manhwa,Romance,Mang... | 4.11 | 79 | 1 |
| 44996 | Honggane | Mi-Ri Hwang | NaN | NaN | Sequential Art,Manga,Manga,Manhwa,Romance,Mang... | 4.04 | 73 | 1 |
These look like duplicate titles, but they have different numbers of ratings and reviews. Perhaps they were scraped at different times. Let's remove the duplicates, and keep the entry with the most ratings.
#Remove the duplicate values, keeping the value with the most ratings
data_clean = data_clean.sort_values(by = ['totalratings'], ascending = False)
data_clean = data_clean.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn'], keep='first')
Which book is missing a title?
data_clean[data_clean['title'].isnull()]
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | |
|---|---|---|---|---|---|---|---|---|
| 54953 | NaN | Jacqui Malpass | NaN | NaN | Diary,Journaling | 3.33 | 9 | 3 |
Since neither the title nor the ISBN is available on Goodreads we'll remove this from the dataset.
#Remove the book with no title from the dataset
data_clean = data_clean.dropna(subset=['title'])
#View how many rows are left in the dataset
data_clean.shape
(99938, 8)
These two steps eliminated 62 rows. Now let's take a look at what the book formats are.
#Display the number of books for each format and display the top 20
data_clean.groupby(['bookformat']).size().sort_values(ascending=False).head(20)
bookformat Paperback 53845 Hardcover 27579 ebook 5328 Kindle Edition 4565 Mass Market Paperback 3024 Unknown Binding 449 Nook 395 Audio CD 318 Board Book 164 Spiral-bound 137 Audiobook 105 Library Binding 69 Board book 65 paperback 38 Audio 35 Trade Paperback 34 Audio Cassette 29 Leather Bound 29 Comics 24 Cards 22 dtype: int64
We'll clean these up by changing the case to lower for consistency, grouping some of the values together (like "mass market paperback" and "trade paperback") and grouping some of the little-used formats into an "other" category.
#Clean the bookformat column
data_clean['bookformat'] = data_clean['bookformat'].str.lower()
data_clean['bookformat'] = data_clean['bookformat'].replace(['kindle edition', 'nook'],'ebook')
data_clean['bookformat'] = data_clean['bookformat'].replace(['mass market paperback', 'softcover', 'trade paperback'],'paperback')
data_clean['bookformat'] = data_clean['bookformat'].replace(['hardback'],'hardcover')
data_clean['bookformat'] = data_clean['bookformat'].replace(['audio cd', 'audio', 'audio cassette', 'audio play','audible audio'],'audiobook')
data_clean['bookformat'] = np.where(data_clean['bookformat'].isin(['paperback','hardcover','ebook','audiobook']), data_clean['bookformat'], 'other')
#Display the total counts and normalized counts for the cleaned formats
formats = pd.concat([data_clean['bookformat'].value_counts(), data_clean['bookformat'].value_counts(normalize=True)], axis=1, keys=('counts','percentage'))
#Format the numeric columns
formats.style.format({'counts':'{:,.0f}',
'percentage':'{:.0%}'})
| counts | percentage | |
|---|---|---|
| paperback | 56,949 | 57% |
| hardcover | 27,613 | 28% |
| ebook | 10,288 | 10% |
| other | 4,577 | 5% |
| audiobook | 511 | 1% |
We can see that paperbacks are by far the most common format on the platform, but hardcovers are significant. Ebooks, Audiobooks, and other formats are far less common.
Now we'll move on the cleaning the genre column. First, we have a number of null values to consider.
#See how many null values there are in the genre column
data_clean['genre'].isnull().sum()
10467
There are 10,467 rows that don't have the genre filled in at all, so we don't know if those books are fiction or non-fiction. We'll fill these rows with "no genre" so they will be considered as a group in the final analysis. If we see high reviews and high ratings for this group it may be worth doing further exploration to see if we can get genre information for these titles.
#Fill null values in the genre column with "no genre"
data_clean['genre'].fillna("no genre", inplace = True)
Next, we'll do the following steps:
#Change the genre column to lowercase
data_clean['genre'] = data_clean['genre'].str.lower()
#Add a column called new_genre which is a copy of genre
data_clean['new_genre'] = data_clean['genre']
#Clean up the variations on "nonfiction"
data_clean['new_genre'] = data_clean['new_genre'].str.replace('non fiction', 'nonfiction')
#Create a new column called supergenre and set the value to "no value"
data_clean['supergenre'] = 'no value'
#When the new_genre column contains "nonfiction" set the supergenre value to "non-fiction"
data_clean.loc[data_clean['new_genre'].str.contains('nonfiction'),'supergenre'] = 'non-fiction'
#Count up the books classified as "no value" and "nonfiction"
data_clean['supergenre'].value_counts()
no value 70459 non-fiction 29479 Name: supergenre, dtype: int64
Now add some logic so if "fiction" appears anywhere in new_genre and is not preceded by "non" the supergenre will be "fiction" (such as in the case of "historical fiction").
#If the new_genre column contains "fiction" not preceded by "non" set the supergenre value to "fiction"
pattern = '(?<!non)fiction'
data_clean.loc[data_clean['new_genre'].str.contains(pattern),'supergenre'] = 'fiction'
#Display the book counts for the supergenre column
data_clean['supergenre'].value_counts()
fiction 36480 no value 35060 non-fiction 28398 Name: supergenre, dtype: int64
Adding the fiction categorization reduced the number of non-fiction books from 29,479 to 28,398. So it appears that there are some books that have both "fiction" and "non-fiction" in the new_genre column. Let's take a closer look at what those are.
#Display books that contain both "fiction" and "nonfiction" in the new_genre column
pd.set_option('display.max_colwidth', None)
bool = (data_clean['new_genre'].str.contains('nonfiction')) & data_clean['new_genre'].str.contains('(?<!non)fiction')
data_clean[bool].head(10)
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | new_genre | supergenre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 8091 | The Art of War | Sun Tzu,Thomas Cleary,Pulat Otkan | paperback | NaN | nonfiction,classics,philosophy,history,war,business,politics,war,military fiction,cultural,china,leadership | 3.97 | 360348 | 12237 | nonfiction,classics,philosophy,history,war,business,politics,war,military fiction,cultural,china,leadership | fiction |
| 10159 | 1776 | David McCullough | paperback | 743226720 | history,nonfiction,north american hi...,american history,historical,military history,american revolution,war,audiobook,american revolution,american revolutionary war,politics,war,military fiction | 4.06 | 201109 | 7468 | history,nonfiction,north american hi...,american history,historical,military history,american revolution,war,audiobook,american revolution,american revolutionary war,politics,war,military fiction | fiction |
| 25441 | The Hunger Games: Official Illustrated Movie Companion | Kate Egan | paperback | 545422906 | nonfiction,science fiction,dystopia,media tie in,adventure,culture,film,young adult,teen,romance,love,thriller,romance,m m romance | 4.52 | 175110 | 412 | nonfiction,science fiction,dystopia,media tie in,adventure,culture,film,young adult,teen,romance,love,thriller,romance,m m romance | fiction |
| 59863 | Julie and Julia: 365 Days, 524 Recipes, 1 Tiny Apartment Kitchen | Julie Powell | hardcover | 031610969X | nonfiction,autobiography,memoir,food and drink,food,food and drink,cooking,biography,womens fiction,chick lit,biography memoir,humor,food and drink,cookbooks,food and drink,foodie | 3.70 | 157546 | 7985 | nonfiction,autobiography,memoir,food and drink,food,food and drink,cooking,biography,womens fiction,chick lit,biography memoir,humor,food and drink,cookbooks,food and drink,foodie | fiction |
| 27395 | American Sniper: The Autobiography of the Most Lethal Sniper in U.S. Military History | Chris Kyle,Scott McEwen,Jim DeFelice | paperback | 62107062 | nonfiction,biography,war,military fiction,history,war,autobiography,memoir,biography,autobiography,audiobook,military,military history,biography memoir | 4.00 | 116982 | 8126 | nonfiction,biography,war,military fiction,history,war,autobiography,memoir,biography,autobiography,audiobook,military,military history,biography memoir | fiction |
| 6265 | He's Just Not That Into You: The No-Excuses Truth to Understanding Guys | Greg Behrendt,Liz Tuccillo | hardcover | 141694740X | nonfiction,self help,womens fiction,chick lit,relationships,humor,psychology,romance,adult,contemporary,humor,comedy | 3.66 | 64018 | 3470 | nonfiction,self help,womens fiction,chick lit,relationships,humor,psychology,romance,adult,contemporary,humor,comedy | fiction |
| 584 | Flags of Our Fathers | James D. Bradley,Ron Powers | paperback | 553384155 | history,nonfiction,war,military fiction,war,war,world war ii,biography,military,military history,north american hi...,american history,historical,adult | 4.19 | 53110 | 1644 | history,nonfiction,war,military fiction,war,war,world war ii,biography,military,military history,north american hi...,american history,historical,adult | fiction |
| 79186 | No Easy Day: The Firsthand Account of the Mission That Killed Osama Bin Laden | Mark Owen,Kevin Maurer | hardcover | 525953728 | nonfiction,war,military fiction,history,war,biography,autobiography,memoir,military,military history,audiobook,politics,biography,autobiography | 3.99 | 48049 | 3978 | nonfiction,war,military fiction,history,war,biography,autobiography,memoir,military,military history,audiobook,politics,biography,autobiography | fiction |
| 24665 | Physics of the Impossible | Michio Kaku | hardcover | 385520697 | science,nonfiction,science,physics,science,popular science,science,technology,science,astronomy,space,science fiction,time travel,audiobook,reference | 4.07 | 35258 | 1358 | science,nonfiction,science,physics,science,popular science,science,technology,science,astronomy,space,science fiction,time travel,audiobook,reference | fiction |
| 16161 | The Looming Tower: Al-Qaeda and the Road to 9/11 | Lawrence Wright | hardcover | 037541486X | history,nonfiction,politics,war,terrorism,war,religion,north american hi...,american history,religion,islam,audiobook,war,military fiction | 4.33 | 27328 | 2435 | history,nonfiction,politics,war,terrorism,war,religion,north american hi...,american history,religion,islam,audiobook,war,military fiction | fiction |
We see that these books include biographies, military history, self-help and other topics we normally find in the non-fiction section, so we'll relabel these non-fiction.
#If the book contains both "fiction" and "nonfiction" in the new_genre column label it "non-fiction" in the supergenre column
data_clean.loc[bool, 'supergenre'] = 'non-fiction'
data_clean[bool].head(10)
#Display the book counts for the supergenre column
data_clean['supergenre'].value_counts()
fiction 35399 no value 35060 non-fiction 29479 Name: supergenre, dtype: int64
We can see that for about a third of our books we don't know if it should be categorized as fiction or non-fiction. We'll have to see if we can categorize them based on other information in the genre column. First we need to transform the new_genre column from strings to lists. Then we'll explode the new genre column which will create one row for every genre in the list. Consequently, a single book can have multiple rows corresponding to each genre listed in its new genres column.
#Split the new_genre string into a list on the comma
data_clean['new_genre'] = data_clean['new_genre'].str.split(",")
#Create a new dataframe with the exploded genres, and show the first 10 rows
exploded_genre = data_clean.explode('new_genre')
exploded_genre.head(10)
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | new_genre | supergenre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | young adult | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | romance | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | fiction | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | contemporary | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | realistic fiction | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | young adult | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | teen | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | young adult | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | coming of age | fiction |
| 27180 | The Fault in Our Stars | John Green | hardcover | NaN | young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love | 4.19 | 3819326 | 158776 | novels | fiction |
Let's verify that this worked by checking a single ISBN in the old and the new dataframe. In the old data we expect to see one row per ISBN.
pd.set_option("display.max_colwidth", None)
data_clean[data_clean['isbn'] == '002914180X']
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | new_genre | supergenre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | [history, military history, civil war, american history, american civil war, nonfiction, north american hi..., american history, native americans] | non-fiction |
In the new dataframe we expect to see 9 rows, one for each genre in the new_genre column, which is what we're seeing.
exploded_genre[exploded_genre['isbn'] == '002914180X']
| title | author | bookformat | isbn | genre | rating | totalratings | reviews | new_genre | supergenre | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | history | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | military history | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | civil war | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | american history | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | american civil war | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | nonfiction | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | north american hi... | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | american history | non-fiction |
| 0 | Between Two Fires: American Indians in the Civil War | Laurence M. Hauptman | hardcover | 002914180X | history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans | 3.52 | 33 | 5 | native americans | non-fiction |
We can now quickly see if some books had duplicate genres and remove duplicate rows, so that each book, format, author and ISBN has only one row for each applicable genre.
#Check the number of rows in the exploded dataset
exploded_genre.shape
(831056, 10)
#Remove duplicate rows where title, author, format, ISBN, and genre are all the same
exploded_genre[exploded_genre.duplicated(subset=['title', 'author', 'bookformat', 'isbn','new_genre'], keep=False)]
exploded_genre = exploded_genre.sort_values(by = ['totalratings'], ascending = False)
exploded_genre = exploded_genre.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn', 'new_genre'], keep='first')
#Check the number of rows that remain
exploded_genre.shape
(678793, 10)
This step removed 152,263 rows. Now let's see how many fiction and non-fiction books we have, and how their average ratings compare, by aggregating data in our unexploded dataset (which has one entry per book).
#Find the counts, average rating, average ratings, and average reviews for each supergenre
supergenre_summary = data_clean.groupby('supergenre').agg(genre_count=('supergenre', 'count'),
avg_rating=('rating', 'mean'),
avg_ratings=('totalratings', 'mean'),
avg_reviews=('reviews', 'mean'))
supergenre_summary.sort_values(by = 'genre_count', ascending=False)
#Format the numeric columns
supergenre_summary.style.format({'genre_count':'{:,.0f}',
'avg_rating':'{:,.2f}',
'avg_ratings':'{:,.0f}',
'avg_reviews':'{:,.0f}'})
| genre_count | avg_rating | avg_ratings | avg_reviews | |
|---|---|---|---|---|
| supergenre | ||||
| fiction | 35,399 | 3.85 | 7,059 | 415 |
| no value | 35,060 | 3.71 | 197 | 15 |
| non-fiction | 29,479 | 3.95 | 1,435 | 100 |
There are a lot of books with "no value" for the supergenre which means it didn't have a value of "fiction" or "non-fiction" in the new_genres column so we don't know how to classify it. Let's see if we can classify some of these by looking at what the most common genres are when the supergenre is "no value."
#View the most popular new_genre when the supergenre = no value
missing_genre_summary = exploded_genre[exploded_genre['supergenre'] == 'no value'].groupby('new_genre').agg(genre_count=('new_genre', 'count'),
avg_rating=('rating', 'mean'),
avg_ratings=('totalratings', 'mean'),
avg_reviews=('reviews', 'mean'))
missing_genre_summary = missing_genre_summary.sort_values(by = 'genre_count', ascending=False)
#Display the top 30 results and format the numeric columns
missing_genre_summary.head(30).style.format({'genre_count':'{:,.0f}',
'avg_rating':'{:,.2f}',
'avg_ratings':'{:,.0f}',
'avg_reviews':'{:,.0f}'})
| genre_count | avg_rating | avg_ratings | avg_reviews | |
|---|---|---|---|---|
| new_genre | ||||
| no genre | 10,467 | 3.34 | 10 | 1 |
| romance | 4,569 | 3.79 | 1,137 | 81 |
| fantasy | 2,909 | 3.89 | 1,512 | 102 |
| sequential art | 2,280 | 3.92 | 422 | 16 |
| history | 2,035 | 3.90 | 46 | 5 |
| cultural | 1,921 | 3.88 | 143 | 12 |
| childrens | 1,813 | 3.79 | 155 | 19 |
| paranormal | 1,695 | 3.87 | 2,375 | 164 |
| religion | 1,617 | 4.06 | 70 | 8 |
| science | 1,464 | 3.91 | 63 | 7 |
| comics | 1,388 | 3.90 | 529 | 20 |
| manga | 1,359 | 4.01 | 565 | 18 |
| european literature | 1,340 | 3.80 | 145 | 9 |
| contemporary | 1,303 | 3.75 | 881 | 77 |
| graphic novels | 1,132 | 3.94 | 726 | 26 |
| picture books | 1,091 | 3.77 | 132 | 22 |
| category romance | 1,075 | 3.57 | 123 | 14 |
| reference | 1,065 | 4.01 | 74 | 4 |
| harlequin | 1,034 | 3.51 | 124 | 14 |
| m m romance | 1,017 | 3.74 | 536 | 57 |
| young adult | 998 | 3.97 | 2,704 | 172 |
| games | 965 | 3.84 | 56 | 2 |
| lgbt | 925 | 3.83 | 583 | 58 |
| art | 923 | 4.04 | 80 | 6 |
| paranormal romance | 898 | 3.91 | 3,773 | 259 |
| poetry | 884 | 4.01 | 180 | 15 |
| contemporary romance | 843 | 3.66 | 802 | 70 |
| animals | 727 | 3.89 | 459 | 37 |
| philosophy | 719 | 4.02 | 62 | 6 |
| magic | 703 | 3.94 | 2,958 | 194 |
We'll remove the "no genre" books from our dataset because we don't have a way to get their genre, they only averaged about 1 review per title. We're most interested in the books that have high reviews. We'll remove all children's and picture books from the dataset since our group won't be for children. Other genres we can verify by checking them here. Anything with a genre of fantasy, sequential art, comics, manga, graphic novels, we'll categorize as fiction. And if the genre contains romance we'll also categorize these as fiction, because the romance genres appear under fiction on the Goodreads genre list.
#Remove childrens and "no genre" books and categorize comics/graphic novels and fantasy as fiction
data_clean_genre = data_clean[data_clean['genre'].str.contains('no genre|childrens|picture books', regex=True)==False]
data_clean_genre.loc[data_clean['genre'].str.contains('romance|fantasy|sequential art|comics|manga|graphic novels', regex=True),'supergenre'] = 'fiction'
Now that we've categorized the books by supergenre as best we can and cleaned up the genre column we can move on to the final analysis and visualizations.
Let's visualize our cleaned-up dataset to see whether fiction or non-fiction books tend to get more engagement, which we'll quantify by looking at the number of reviews per title for each.
#Create a plotly box plot of reviews per title in our cleaned dataset
fig = px.box(data_clean_genre[data_clean_genre['supergenre'] != 'no value'], x="supergenre", y="reviews",
range_y=[0,450],
points='outliers',
title='Fiction Books Get More Reviews Than Non-Fiction',)
fig.show()
The box plots show is that fiction books tend to get much higher numbers of reviews. The median for fiction is 46, while the median for non-fiction is 17. The 3rd quartile for fiction is 167 while for non-fiction it's 52.
This also shows us there are many outliers for both groups so when evaluating the genres and supergenres we'll use the median ratings and median reviews, because the median is less likely to be skewed by extreme outliers than the average.
Next we'll show the results in chart format, specifically the median number of ratings and reviews. For the 5-star rating we'll use the average (since the rating only goes up to 5) but if the book received fewer than 20 ratings we'll exclude it from the average star rating. We'll also calculate a new column for review ratio, which is the total number of reviews divided by the total number of ratings. This tells us, for users who rated a book, how likely were they to also leave a review.
#Add a column for review ratio
data_clean_genre_with_new_col = data_clean_genre.copy()
data_clean_genre_with_new_col['review_ratio'] = data_clean_genre_with_new_col['reviews'] / data_clean_genre_with_new_col['totalratings']
#Generate numerical chart grouped by supergenre with counts, median ratings, and median reviews
supergenre_summary = data_clean_genre_with_new_col.groupby('supergenre').agg(genre_count=('supergenre', 'count'),
median_ratings=('totalratings', 'median'),
median_reviews=('reviews', 'median'),
median_review_ratio=('review_ratio', 'median'))
supergenre_summary.reset_index()
#Calculate the average star rating by supergenre excluding books with less than 20 ratings.
data_clean_ratings = data_clean_genre_with_new_col[data_clean_genre_with_new_col['totalratings'] >= 20]
supergenre_summary_ratings = data_clean_ratings.groupby('supergenre').agg(avg_star_rating=('rating', 'mean'))
supergenre_summary = supergenre_summary.merge(supergenre_summary_ratings, on='supergenre', how='left')
#Format the numeric columns
format_dict = {'genre_count':'{:,.0f}'
,'median_ratings':'{:,.0f}'
,'median_reviews':'{:,.0f}'
,'avg_star_rating':'{:,.3f}'
,'median_review_ratio':'{:,.3f}'
}
#Reorder the columns, sort by median reviews and format the numeric columns
supergenre_summary = supergenre_summary[['genre_count',
'median_ratings',
'median_reviews',
'avg_star_rating',
'median_review_ratio']]
supergenre_summary = supergenre_summary.sort_values('median_reviews', ascending=False).reset_index()
supergenre_summary.style.hide(axis="index").format(format_dict)
| supergenre | genre_count | median_ratings | median_reviews | avg_star_rating | median_review_ratio |
|---|---|---|---|---|---|
| fiction | 38,524 | 547 | 46 | 3.844 | 0.088 |
| non-fiction | 27,110 | 163 | 17 | 3.956 | 0.104 |
| no value | 15,974 | 23 | 2 | 3.949 | 0.083 |
The results support our initial hypothesis that fiction had more appeal for an online book group. Fiction has roughly 3x higher median ratings and reviews than non-fiction. The average star rating was only 3% lower than non-fiction so readers rated them similarly in terms of satisfaction.
The median review ratio shows that for readers of non-fiction, they may be 18% more likely to leave a review if they rated the book. This may be because non-fiction has smaller, more niche audiences that are a bit more engaged, but since fiction attracts a much bigger audience, we'll still recommend focusing on fiction books.
The numerical chart also shows that we've reduced the number of titles with "no value" for fiction or non-fiction from 35% to 20% of our dataset. These titles tend to have low ratings and reviews, so we'll ignore them for the time being.
Now we want to see whether the format of the book has any bearing on how successfully it attracts reviews. We'll focus on fiction books and on the four most popular formats (hardcover, ebook, paperback, and audiobook), summarize the results and then visualize the median reviews for each format.
#Calculate the counts, median reviews, median ratings, and median review ratio by format
supergenre_format_summary = data_clean_genre_with_new_col.groupby(['supergenre', 'bookformat']).agg(genre_count=('supergenre', 'count'),
median_ratings=('totalratings', 'median'),
median_reviews=('reviews', 'median'),
median_review_ratio=('review_ratio', 'median'))
supergenre_format_summary.reset_index()
#Calculate the average star rating by supergenre excluding books with less than 20 ratings.
supergenre_format_summary_ratings = data_clean_ratings.groupby(['supergenre','bookformat']).agg(avg_star_rating=('rating', 'mean'))
#Merge the average star rating to the other data
supergenre_format_summary = supergenre_format_summary.merge(supergenre_format_summary_ratings, on=['supergenre', 'bookformat'], how='left')
#Reorder columns, sort by median reviews, and filter for popular fiction formats
supergenre_format_summary = supergenre_format_summary[['genre_count', 'median_ratings', 'median_reviews', 'avg_star_rating', 'median_review_ratio']]
supergenre_format_summary = supergenre_format_summary.sort_values(by='median_reviews', ascending=False).reset_index()
supergenre_format_summary = supergenre_format_summary[(supergenre_format_summary['bookformat'] != 'other') & (supergenre_format_summary['supergenre'] == 'fiction')]
supergenre_format_summary.style.hide(axis="index").format(format_dict)
| supergenre | bookformat | genre_count | median_ratings | median_reviews | avg_star_rating | median_review_ratio |
|---|---|---|---|---|---|---|
| fiction | hardcover | 6,686 | 1,101 | 114 | 3.823 | 0.105 |
| fiction | ebook | 7,442 | 548 | 50 | 3.815 | 0.092 |
| fiction | paperback | 22,780 | 482 | 40 | 3.856 | 0.083 |
| fiction | audiobook | 290 | 272 | 24 | 3.773 | 0.092 |
#Create a bar chart of the median reviews from the format summary table
df = supergenre_format_summary
fig = px.bar(df, x="median_reviews", y="bookformat", orientation='h',
hover_data={'median_reviews':True,
'bookformat':False},
hover_name='bookformat',
title='Hardcover Fiction Books Get More Reviews Than Other Formats',
labels={'median_reviews': 'median reviews',
'bookformat': 'format'}
)
fig.show()
While paperbacks outnumber hardcovers 3 to 1, hardcover books have 3x higher median reviews and 2x higher ratings. Their review ratio is the highest of all types, meaning audiences that read and rate are also the most likely to leave a review as well.
The difference in average star rating was very slight, so readers were about as satisfied regardless of format.
These results may indicate that books tend to get higher reviews when they're new since publisher typically release books in hardcover first. They may speak to the higher quality of books that appear in hardcover format since publishers put their highest-priority titles in hardcover.
Now we want to examine whether there are specific genres of fiction books that tend to attract the greatest number of reviews. First we'll look at how many subgenres exist within fiction.
#Filter our cleaned dataset for fiction
data_clean_fiction = data_clean_genre_with_new_col[data_clean_genre['supergenre'] == 'fiction']
#Explode the new_genre column, allowing us to count the genres
exploded_data_clean_fiction = data_clean_fiction.explode('new_genre')
exploded_data_clean_fiction = exploded_data_clean_fiction.sort_values(by = ['totalratings'], ascending = False)
#Drop duplicate genres from the exploded dataset
exploded_data_clean_fiction = exploded_data_clean_fiction.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn', 'new_genre'], keep='first')
#Calculate the book counts, median ratings, median reviews, and median ratings ratio for each genre
genre_summary = exploded_data_clean_fiction.groupby('new_genre').agg(genre_count=('new_genre', 'count'),
median_ratings=('totalratings', 'median'),
median_reviews=('reviews', 'median'),
median_review_ratio=('review_ratio', 'median'))
genre_summary.sort_values(by = 'genre_count', ascending=False).reset_index()
genre_summary = genre_summary[['genre_count', 'median_ratings', 'median_reviews', 'median_review_ratio']]
#Summarize the genre results
genre_summary.describe()
| genre_count | median_ratings | median_reviews | median_review_ratio | |
|---|---|---|---|---|
| count | 959.000000 | 959.000000 | 959.000000 | 959.000000 |
| mean | 364.544317 | 872.965589 | 79.163191 | 0.103770 |
| std | 1373.989887 | 1413.337788 | 125.915635 | 0.049932 |
| min | 1.000000 | 16.000000 | 1.000000 | 0.008403 |
| 25% | 5.500000 | 269.000000 | 26.000000 | 0.073956 |
| 50% | 36.000000 | 572.000000 | 51.000000 | 0.096338 |
| 75% | 200.000000 | 1060.750000 | 93.000000 | 0.123965 |
| max | 25103.000000 | 32453.000000 | 2901.000000 | 0.522013 |
We have 959 genres which is quite a lot, and some of these have as few as one title within them. We want to evaluate genres that have a large number of titles because a larger sample will give us more reliable results, and we also want to be sure there's a large number of potential titles to pick for our book group. To get a sense of how many titles we should be looking for we'll look at the books counts for the genres in the top percentiles.
#Show the book counts for the 75th-99th percentiles
n = [75, 80, 85, 90, 95, 99]
for i in n:
print("The " + str(i) + "th percentile book count is: ", round(np.percentile(genre_summary['genre_count'], i),0))
The 75th percentile book count is: 200.0 The 80th percentile book count is: 263.0 The 85th percentile book count is: 401.0 The 90th percentile book count is: 636.0 The 95th percentile book count is: 1562.0 The 99th percentile book count is: 5733.0
To simplify this dataset of 959 genres, we'll pick a round number between the 90th-95th percentile and say the genre needs to have at least 1,000 books. Next we'll calculate the average star rating per genre like we did before, by counting the book if it had at least 20 reviews.
#Calculate the average star rating for each genre, including books with at least 20 ratings
rating_summary = exploded_data_clean_fiction[exploded_data_clean_fiction['totalratings'] >= 20].groupby('new_genre').agg(avg_star_rating=('rating', 'mean'))
rating_summary = rating_summary.sort_values(by = 'avg_star_rating', ascending=False).reset_index()
#Merger the star rating with the other data
genre_summary = genre_summary.merge(rating_summary, on='new_genre', how='left')
#Reorder the columns
genre_summary = genre_summary[['new_genre', 'genre_count', 'median_ratings', 'median_reviews', 'avg_star_rating', 'median_review_ratio']]
#Filter the genres for only those with at least 1000 books in the dataset
genre_summary_top = genre_summary[genre_summary['genre_count'] >= 1000].sort_values(by='genre_count', ascending=False)
genre_summary_top.describe()
| genre_count | median_ratings | median_reviews | avg_star_rating | median_review_ratio | |
|---|---|---|---|---|---|
| count | 74.000000 | 74.000000 | 74.000000 | 74.000000 | 74.000000 |
| mean | 3437.783784 | 1136.939189 | 93.912162 | 3.865502 | 0.084631 |
| std | 3748.518221 | 760.327633 | 68.972019 | 0.086889 | 0.017961 |
| min | 1001.000000 | 132.000000 | 15.000000 | 3.597572 | 0.031645 |
| 25% | 1342.000000 | 537.000000 | 49.250000 | 3.813099 | 0.074739 |
| 50% | 2405.500000 | 985.000000 | 73.000000 | 3.854770 | 0.085250 |
| 75% | 4083.750000 | 1501.000000 | 117.750000 | 3.917580 | 0.094691 |
| max | 25103.000000 | 3614.000000 | 314.000000 | 4.146836 | 0.120338 |
This narrows down the group to 74 genres, which is still a large number. We'll look at only the ones that are in the 75th percentile for ratings and reviews, and the 50th percentile for rating ratio, from this group. There isn't much variation in the star rating (the standard deviation is only .09) so we'll just focus on those three criteria.
We'll also remove some genres that aren't terribly meaningful from the result set like "novels," "adult," "audiobook," and "american."
#Further filter genre list for the top in median reviews, median ratings, and median ratings ratio
recommended = (genre_summary_top['median_reviews'] >= 118) & (genre_summary_top['median_ratings'] >= 1501) & (genre_summary_top['median_review_ratio'] >= .085) & (genre_summary_top['new_genre'] != 'novels') & (genre_summary_top['new_genre'] != 'audiobook') & (genre_summary_top['new_genre'] != 'adult') & (genre_summary_top['new_genre'] != 'american')
genre_summary_top_recommended = genre_summary_top[recommended].sort_values(by='median_reviews', ascending = False)
#Format the numeric columns
genre_summary_top_recommended.style.hide(axis="index").format(format_dict)
| new_genre | genre_count | median_ratings | median_reviews | avg_star_rating | median_review_ratio |
|---|---|---|---|---|---|
| teen | 1,001 | 2,719 | 314 | 3.792 | 0.120 |
| womens fiction | 2,068 | 3,243 | 278 | 3.814 | 0.085 |
| mystery thriller | 2,416 | 2,461 | 212 | 3.859 | 0.089 |
| literary fiction | 1,622 | 1,742 | 201 | 3.742 | 0.117 |
| dystopia | 1,173 | 1,896 | 198 | 3.846 | 0.108 |
| suspense | 2,395 | 2,351 | 190 | 3.901 | 0.085 |
| thriller | 3,873 | 2,019 | 171 | 3.849 | 0.088 |
| young adult | 5,037 | 1,849 | 153 | 3.899 | 0.092 |
| crime | 2,973 | 1,817 | 151 | 3.858 | 0.087 |
| british literature | 1,698 | 1,513 | 145 | 3.801 | 0.093 |
#Create a bubble chart of the top genres plotted with median reviews on x-axis, median ratings on y-axis, and book count for bubble size
df = genre_summary_top_recommended
fig = px.scatter(df, x="median_reviews", y="median_ratings",
color="new_genre",
size='genre_count',
hover_data={'new_genre':False,
'median_reviews':True,
'median_ratings':True,
'genre_count':False},
hover_name='new_genre',
range_x=[100,350],
range_y=[100,3500],
title='Top Fiction Genres For Median Reviews',
labels={'median_ratings': 'median ratings',
'median_reviews': 'median reviews',
'new_genre': 'genre'},
text='new_genre',
size_max=40,
opacity=.5)
fig.update_layout(
font=dict(
size=10
)
)
fig.show()
Based on these results a few themes emerge:
Fiction books for teens aged 13 to 18 was a breakout fiction genre with the highest median reviews. It also had the highest review-to-rating ratio. But we want our group to be open to adult readers, so we'll rule this out.
Women's fiction was the next-highest fiction genre for median reviews and the highest for ratings. A book group for women's fiction is an intriguing choice, but we want our group to be open to all genders, so we'll rule this out.
Mystery thrillers was third for ratings and reviews. Close relatives "suspense" and "thriller" also appear on this list. This could be a good potential choice for our group.
Literary fiction--a broad category for books that aspire to literary merit--was fourth for median reviews and had the second-highest reviews-to-ratings ratio. This means that, though mystery thrillers may reach a larger audience of readers, the readers of literary fiction were about 30% more likely to leave a review, indicating these books can generate a lot of discussion.
We'll recommend targeting the book group at literary fiction readers since this is likely to appeal to a large audience, to generate a high amount of discussion, and will be inclusive of different age groups and genders.
Let's look at the top literary fiction books for reviews from this dataset to get a sense of which specific titles we could recommend for the book group.
#Format the numeric columns
titles_format_dict = {'totalratings':'{:,.0f}'
,'reviews':'{:,.0f}'
,'review_ratio':'{:,.3f}'
}
#Display the top literary fiction for adults sorted by total reviews
literary_fiction = (data_clean_genre_with_new_col['genre'].str.contains('literary fiction', regex=False)) & (~data_clean_genre_with_new_col['genre'].str.contains('young adult', regex=False))
data_clean_genre_with_new_col.loc[literary_fiction, ['title', 'author','bookformat','totalratings','reviews','review_ratio']].sort_values(by='reviews', ascending=False).head(10).style.hide(axis="index").format(titles_format_dict)
| title | author | bookformat | totalratings | reviews | review_ratio |
|---|---|---|---|---|---|
| And the Mountains Echoed | Khaled Hosseini | hardcover | 319,213 | 28,936 | 0.091 |
| Americanah | Chimamanda Ngozi Adichie,Jashar Awan | hardcover | 295,475 | 24,915 | 0.084 |
| Life After Life | Kate Atkinson | hardcover | 207,029 | 24,729 | 0.119 |
| Norwegian Wood | Haruki Murakami,Jay Rubin | paperback | 385,612 | 23,958 | 0.062 |
| The Unlikely Pilgrimage of Harold Fry | Rachel Joyce | hardcover | 157,045 | 18,179 | 0.116 |
| The Sense of an Ending | Julian Barnes | hardcover | 156,819 | 16,481 | 0.105 |
| Rules of Civility | Amor Towles | hardcover | 160,843 | 15,917 | 0.099 |
| Beautiful Ruins | Jess Walter | hardcover | 163,460 | 15,682 | 0.096 |
| Stoner | John Williams,John McGahern | paperback | 113,334 | 12,547 | 0.111 |
| We Are All Completely Beside Ourselves | Karen Joy Fowler | hardcover | 110,850 | 11,696 | 0.106 |
Almost all of these titles (with the exception of Norweigian Wood and Stoner) were hardcover editions published in the 2010s. Therefore we could focus our group on new literary fiction in hardcover editions, since we've seen that hardcovers tend to elicit more reviews. We could see if any of these authors of the top books have new books coming out. For example, Karen Joy Fowler published a new book in March 2022, Julian Barnes published a new book in August 2022, and Kate Atkinson published a new book in September 2022.